{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 正则表达式\n",
    "正则表达式是**匹配字符串**的工具\n",
    "\n",
    "设计思想:\n",
    "使用描述性语言给字符串定义一个规则，符合规则的字符串被认定为匹配，否则字符串不合法"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "匹配符合其功能:\n",
    "\n",
    "|匹配符|功能|示例|\n",
    "|:-|:-|:-|\n",
    "|\\d|匹配一个数字|  00\\d --- 007|\n",
    "|\\w|匹配一个数字或字母| \\w\\w\\d --- py3|\n",
    "|\\s|匹配一个空格(包括Tab等空白符)| a\\sa --- a  a|\n",
    "|.|匹配任意字符| py. --- pyc|\n",
    "|*| 表示任意个字符(包括0个)| \\d* --- 000|\n",
    "|+|表示至少1个字符| \\d+ --- 7|\n",
    "|?|表示0个或1个字符| \\d? --- 7|\n",
    "|{n}|表示n个字符| \\d{3} --- 123|\n",
    "|{n,m}|表示n~m个字符| \\d{1,7} --- 456|\n",
    "|-|特殊字符需要\\\\转义|\\\\- --- -|"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "更精确的匹配使用`[]`划分范围，范围里的表达式至少有一个成立:\n",
    "- `[0-9a-zA-Z\\_]`匹配一个数字、字母或下划线(可选的，满足一个就行)\n",
    "- `[0-9a-zA-Z\\_]+`匹配至少一个数字、字母或下划线组成的字符串，如a100、_py、Py3，同样`+`位置可以使用`*`，`{n,m}`等来控制字符串的个数\n",
    "- `[a-zA-Z\\_][0-9a-zA-Z\\_]*`匹配字母或下划线开头，后接任意个数数字、字母或下划线组成的字符串，命名的规则\n",
    "- `A|B`可以匹配A或B，所以(P|p)ython可以匹配'python'或'Python'，用`()`可以分组\n",
    "- `^`表示行的开头，`$`表示行的结尾，所以`^\\d`表示以数字开头，`\\d$`表示以数字结尾，`^py$`只能匹配`py`\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 1. re模块匹配\n",
    "re模块包含所以正则表达式的功能，`re.match(r'Regular Expression', test_str)`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1.使用转义符的结果: ABC\\-001\n",
      "2.使用raw字符串: ABC\\-001\n"
     ]
    }
   ],
   "source": [
    "# 注意python字符串总 \\ 也要用转义\n",
    "s = 'ABC\\\\-001'   # 对应正则表达式的ABC\\-001，\\用来对-转义\n",
    "print('1.使用转义符的结果:', s)\n",
    "# 使用r'string'来解决该问题\n",
    "s = r'ABC\\-001'\n",
    "print('2.使用raw字符串:',s)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1.匹配成功: <_sre.SRE_Match object; span=(0, 10), match='010-123456'> <class '_sre.SRE_Match'>\n",
      "1.匹配成功: None <class 'NoneType'>\n"
     ]
    }
   ],
   "source": [
    "# 使用re模块进行字符串匹配\n",
    "import re \n",
    "result = re.match(r'^\\d{3}\\-\\d{3,8}$', '010-123456') # 开头3个数字 + '-' + 结尾3~8个数字\n",
    "print('1.匹配成功:', result, type(result))     # 匹配成功后返回一个mathc对象\n",
    "result = re.match(r'^\\d{3}\\-\\d{3,8}$', '010 12345') \n",
    "print('1.匹配成功:', result, type(result))     # 匹配失败后返回None"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1.字符串开头要有使正则表达式成立的部分: 0a\n"
     ]
    }
   ],
   "source": [
    "result = re.match(r'[a-z0-9]{2}', '0abc')\n",
    "print('1.字符串开头要有使正则表达式成立的部分:', result[0])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "failed\n"
     ]
    }
   ],
   "source": [
    "# 匹配的步骤:\n",
    "import re\n",
    "test = '用户输入的字符串'\n",
    "if re.match(r'正则表达式', test):\n",
    "    print('ok')\n",
    "else:\n",
    "    print('failed')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2. 切分字符串\n",
    "`re.split(r'Regular Expression', test_str)`任意切分字符串"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1.一般字符串的切分结果: ['a', 'b', '', '', 'c']\n"
     ]
    }
   ],
   "source": [
    "split_list = 'a b   c'.split(' ')\n",
    "print('1.一般字符串的切分结果:', split_list)          # 无法切分连续的空格"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1.正则表达式切分字符串的结果: ['a', 'b', 'c']\n",
      "2.正则表达式切分字符串的结果: ['a', 'b', 'c', 'd']\n",
      "3.正则表达式切分字符串的结果: ['a', 'b', 'c', 'd']\n"
     ]
    }
   ],
   "source": [
    "# 1. 使用正则表达式切分字符串\n",
    "split_list = re.split(r'[\\s]+', 'a b   c')          # 使用至少1个空格的字符串来切分，这里的[]可以省略\n",
    "print('1.正则表达式切分字符串的结果:', split_list) \n",
    "split_list = re.split(r'[\\s\\,]+', 'a,b, c , d')     # 使用至少1个空格或1个逗号，组成的字符串\n",
    "print('2.正则表达式切分字符串的结果:', split_list) \n",
    "split_list = re.split(r'[\\s\\,\\;]+', 'a,b;; c  d')   # 使用至少1个空格的字符串来切分 \n",
    "print('3.正则表达式切分字符串的结果:', split_list) "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 3. 分组\n",
    "`()`定义要提取的分组，`group()`方法提取子串"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1.分组的结果:\n",
      "('010', '12345')\n",
      "010-12345\n",
      "010\n",
      "12345\n"
     ]
    }
   ],
   "source": [
    "result = re.match(r'^(\\d{3})-(\\d{3,8})$', '010-12345')  # ^和 $ 要在()外部，这里共定义了两个组\n",
    "print('1.分组的结果:')\n",
    "print(result.groups())\n",
    "print(result.group(0))   # 0表示所有的字符串，1表示第一组，2表示第二组，用()标注的代表一个组\n",
    "print(result.group(1))\n",
    "print(result.group(2))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "('19', '05', '30')\n"
     ]
    }
   ],
   "source": [
    "# 识别时间的例子\n",
    "t = '19:05:30'\n",
    "expression = r'^(0[0-9]|1[0-9]|2[0-3]|[0-9])\\:\\\n",
    "(0[0-9]|1[0-9]|2[0-9]|3[0-9]|4[0-9]|5[0-9]|[0-9])\\:\\\n",
    "(0[0-9]|1[0-9]|2[0-9]|3[0-9]|4[0-9]|5[0-9]|[0-9])$'   # 注意\\:\\第二个\\是为了换行用的，写一行就去掉它\n",
    "# | 管道符表示可以取任意一个\n",
    "result = re.match(expression, t)\n",
    "print(result.groups())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 4. 贪婪匹配\n",
    "正则匹配默认是贪婪匹配，尽可能多的匹配字符"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "('102300', '')"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "result = re.match(r'^(\\d+)(0*)$', '102300')    # \\d+把1023后的00也匹配到了，所以0*没有匹配的字符串\n",
    "result.groups()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "使用`?`可以使匹配符采用非贪婪匹配法"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "('1123', '00')"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "result = re.match(r'^(\\d+?)(0*)$', '112300')    # \\d+?只匹配了1023\n",
    "result.groups()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 5. 编译\n",
    "1. 编译正则表达式，如果字符串本身不合法，会报错\n",
    "2. 用编译后的正则表达式匹配字符串"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class '_sre.SRE_Pattern'>\n",
      "('010', '12345')\n",
      "('010', '8086')\n"
     ]
    }
   ],
   "source": [
    "# 可以预先编译正则表达式，然后直接匹配即可\n",
    "re_telephone = re.compile(r'^(\\d{3})\\-(\\d{3,8})$')\n",
    "print(type(re_telephone))\n",
    "print(re_telephone.match('010-12345').groups())   # 编译后的表达式可以直接匹配\n",
    "print(re_telephone.match('010-8086').groups())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "正则表达式练习"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "ok\n"
     ]
    }
   ],
   "source": [
    "# 1. 写一个验证Email地址的正则表达式，someone@gmail.com，bill.gates@microsoft.com\n",
    "import re\n",
    "def is_valid_email(addr):\n",
    "#     expression = r'^([a-zA-Z\\.]+)@([a-zA-Z]+)(\\.com)'    # 这里的\\.直接写 . 也可以@不是特殊字符，不用转义\n",
    "    expression = r'^(\\w+\\.?\\w+)@(\\w+)(\\.\\w+)'\n",
    "    result = re.match(expression, addr)\n",
    "#     print(result)\n",
    "    return True if result else False\n",
    "\n",
    "assert is_valid_email('someone@gmail.com')\n",
    "assert is_valid_email('bill.gates@microsoft.com')\n",
    "assert not is_valid_email('bob#example.com')\n",
    "assert not is_valid_email('mr-bob@example.com')\n",
    "print('ok')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "('Tom Paris',)\n",
      "('tom',)\n",
      "ok\n"
     ]
    }
   ],
   "source": [
    "# 2. 以提取出带名字的Email地址，<Tom Paris> tom@voyager.org => Tom Paris，bob@example.com => bob\n",
    "def name_of_email(addr):\n",
    "#     expression = r'^<?(\\w+\\s*\\w+)?>?\\s*\\w*@\\w+\\.\\w+$'   # 太难看了，但能用\n",
    "    expression = r'<*(\\w*\\s*\\w*)>*\\s*\\w*@\\w+\\.\\w+'        # 可用, *表示任意个，包括0\n",
    "    result = re.match(expression, addr)   \n",
    "    print(result.groups())\n",
    "    return (result.group(1))   \n",
    "assert name_of_email('<Tom Paris> tom@voyager.org') == 'Tom Paris'\n",
    "assert name_of_email('tom@voyager.org') == 'tom'\n",
    "print('ok')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1.匹配成功: <_sre.SRE_Match object; span=(0, 2), match='a '> <class '_sre.SRE_Match'>\n"
     ]
    }
   ],
   "source": [
    "result = re.match(r'^\\w+\\s*\\s*', 'a = ')         # 匹配开头字母与结尾的空格\n",
    "print('1.匹配成功:', result, type(result))        # 输出匹配到的字符串部分"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "小结:\n",
    "- 正则匹配时，字符串中需要匹配的字符要在正则表达式中，否则会报错`TypeError: 'NoneType' object is not subscriptable`\n",
    "- 正则表达式处理字符串从左边开始\n",
    "- 字符串从左边起只要满足正则表达式，就算匹配成功，不需要字符串的所有部分匹配"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
